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ABSTRACT 


The current study explores the ability to predict argumentative 
claims in structurally-annotated student essays to gain insights into 
the role of argumentation structure in the quality of persuasive 
writing. Our annotation scheme specified six types of 
argumentative components based on the well-established 
Toulmin’s model of argumentation. We developed feature sets 
consisting of word count, frequency data of key n-grams, 
positionality data, and other lexical, syntactic, semantic features 
based on both sentential and suprasentential levels. The 
suprasentential Random Forest model based on frequency and 
positionality features yielded the best results, reporting an accuracy 
of 0.87 and kappa of 0.73. This model will be included in an online 
writing assessment tool to generate feedback for student writers. 
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1. INTRODUCTION 


Written argumentation has been an important area of study for 
many years [43, 45]. Recent developments in natural language 
processing (NLP) have introduced new approaches to 
automatically detect the discourse structure of argumentative 
essays [7, 8, 9, 10, 26, 33, 34, 38, 44, 45]. These studies have shown 
that content (i.e., lexical, syntactic, and semantic) and structural 
features (i.e., the positionality of tokens, sentences, and paragraphs) 
are effective in detecting discourse elements. 


Researchers have used fixed discourse markers at the word and 
phrase levels [5, 12, 18, 42] as indicators of different argumentative 
structures. This approach has been applied in discourse [17, 19, 22] 
and NLP analyses [7, 8, 9, 47]. These studies generally identify 
relations between discourse markers and their functions according 
to the conceptual framework of conjunctive relations [36]. For 
instance, phrases such as in summary and in conclusion are 
associated with the discourse function of ‘summarizing’ an 
argument. Such discourse markers have been used to identify the 
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attributes of the structural elements in argumentative essays [8, 9, 
36, 37]. For example, Burstein et al. [7] annotated structural 
information of argumentative essays collected from TOEFL, GRE, 
and GMAT. Discourse markers indicating each of the 
argumentative functions were extracted automatically from the 
essays. A word list that contained the discourse markers and their 
corresponding argumentative functions was formed and used to 
automatically predict instances of argumentation. Similarly, Palau 
and Moens [37] implemented a context-free ruled-based approach 
for argumentation mining in legal texts. They focused on and 
developed rules based on common expressions encountered in the 
legal documents such as for these reasons, in light of all the 
material, and discourse markers, such as however or furthermore. 
Using this approach, they obtained accuracy of approximately 0.6 
in detecting the argumentation structures, while maintaining F1- 
measure of around 0.7 for recognizing premises and conclusions in 
legal texts. 


In more recent work, Stab and Gurevych [44, 45] provided publicly 
available corpora comprising students’ argumentative essays and 
annotation guidelines for parsing argumentations. In these corpora, 
the essays were annotated based on three major argumentative 
categories: major claim, claim, and premise. They then used lexical, 
structural, syntactic, discourse markers, and other features to 
identify argument components. The lexical features consisted of 
binary lemmatized unigrams and the 2,000 most frequent bigrams 
extracted from a training corpus. The structural features captured 
the position of components in the text and the number of tokens in 
those components. Discourse markers included logical connectives 
such as therefore, thus, or consequently and the use of first-person 
pronouns (which indicated major claims). The syntactic features 
included part of speech (POS) distributions, number of sub-clauses, 
and the tense of the main verb. Using support vector machine 
models, Stab and Gurevych [45] found that a combination of all 
these features yielded an Fl score of 0.77. Khatib et al. in [3] 
employed a classifier for argumentativeness based on the research 
in [37, 44, 45], and evaluated its performance on student essays 
from [44]. Khatib et al. used n-grams, syntax, discourse makers and 
part of speech (POS) features in an argument. Their results 
indicated that a combination of n-grams, POS tags, and syntax 
features yielded accuracy of 0.64, 0.62, and 0.59 on classifying 
arguments in students' essays, while the full feature set model 
yielded an accuracy of 0.67. Though only unigram through tri- 
grams were included in the POS feature. 


Though the use of discourse markers, n-grams and POS as 
indicators has been common in the detection of argumentative 
elements, few studies have examined whether using longer 
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sequences of n-grams (beyond tri-grams) and their POS tags would 
contribute to identifying argumentative features. We also note that 
other types of linguistic features related to lexical, structural, 
cohesion, and affective features were not tested in previous studies 
[e.g., 37, 45]. Therefore, this study explores a wider range of NLP 
features, and examines their contribution to model accuracy. We do 
so specifically on a corpus of student essays annotated on 
theoretically-aligned classifications of argumentative elements 
expected in academic settings. This is in contrast to most of the 
existing corpora in English that are annotated for argumentative 
structures and are from the domains of law [e.g., 4, 37], biology and 
medicine [e.g., 20], and user-generated content, e.g., Wikipedia 
articles or debate data, see [1, 2, 27, 41]. Few corpora [44, 45] have 
been developed for argumentation mining in the educational 
settings. In this study, we build on Stab and Gurevych’s work [44, 
45] by developing a structurally annotated corpus based on the 
Toulmin model [46] of argumentation that better reflects the 
structure of student essays. Our objective is for the corpus — and the 
models of argumentation developed from the corpus — to contribute 
to the development of writing assessment tools that can deliver 
useful feedback to student writers. 


Thus, in this study, we introduce a new corpus of essays annotated 
for argumentative features. We then develop NLP approaches to 
automatically identify claims in structurally annotated 
argumentative essays using length, frequency data of significant n- 
grams and POS tags, positionality data, and a wide range of lexical, 
syntactic, cohesion, and cognitive features extracted from a number 
of NLP tools [14, 15, 25, 24]. We compared the identification 
accuracy of multiple machine learning classifiers using different 
types of derived features at different levels (based on sentences or 
argumentative elements that are suprasentential). Our goal is to 
better understand whether and how the selection of the linguistic 
features, the level of units for identification (both sentential and 
suprasentential), and the choice of classifiers influence the 
accuracy of claim identification. Finally, we conduct an error 
analysis of the best model and discuss the distribution of the 
misclassification instances and related features. This study is 
guided by the following research question: 


To what extent do length, frequency of significant n-grams (and 
POS tags of n-grams), lexical, syntactic, and semantic features, and 
positionality predict argumentative claims in essays? 


2. METHOD 
2.1 Corpus 


For the analysis, we annotated 314 persuasive essays. The essays 
were written by undergraduate students (NV = 314) at a public 
university in the United States who were native speakers of English. 
Two prompts from retired test banks of the Scholastic Assessment 
Test (SAT) were used. The prompts were counterbalanced such that 
half of the students wrote about ‘originality and uniqueness’ while 
the other half wrote about ‘heroes versus celebrities.’ All essays 
had been scored previously by expert raters for holistic writing 
quality. For each essay, we extracted the average number of letters 
per word, the number of words, number of types, type-token ratio, 
average number of words per sentence, the number of sentences 


' https://www.tagtog.net 


and paragraphs. Descriptive statistics for these items of the 314 
essays are reported in Table 1. 


Table 1. Descriptive statistics of the persuasive essays 


Mean SD Median Range 
Letters per word 4.52 0.24 4.51 1.50 
Number of words 354.46 118.20 344.00 680.00 
Number of types 178.17 50.01 173.00 279.00 
Type-token ratio 0.52 0.07 0.52 0.41 


Words per sentence 17.74 4.30 17.06 35.08 
Number of sentences 20.65 742 20.00 48.00 
Number of paragraphs 3.86 1.38 4.00 7.00 


2.2 Annotation of argumentative elements 

The essays were structurally annotated by normed raters for 
argumentative elements. We used the modified Toulmin models 
[46] presented in [35] and [30] as the basis for the annotation rubric. 
The rubric adopted six elements (i.e., micro-categories) as the 
building blocks of the argumentation framework: Final Claim, 
Primary Claim, Counterclaim, Rebuttal, Data, and Concluding 
Summary. The definitions of each of these elements are presented 
in Appendix A. 


The essays were coded by two annotators on the web-based text 
annotation platform ‘Tagtog’!. The two annotators were both native 
speakers of English and were undergraduate students majoring in 
applied linguistics at a public university in the United States. Before 
independent annotation, a norming process was conducted to help 
ensure consistency in annotations. Once normed, the two 
annotators worked independently and coded the 314 essays in the 
opposite order to avoid recency effects. 


The two annotators made decisions on both the boundary of an 
argumentative element and the category of the element. An 
argumentative element was inherently suprasentential (i.e., 
according to the annotation scheme derived from the norming 
session, it could contain one or more sentences, and the content 
could be over the span of paragraphs). Inter-rater reliability 
calculated using Fleiss’s Kappa for all the annotations was 0.584 (p 
< 0.001), indicating fair to good agreement [16]. Disagreements of 
either boundary or category of the argumentative elements between 
the two annotators were adjudicated by an expert adjudicator who 
had years of experience teaching and conducting writing research. 
In the case of disagreement, the expert adjudicator compared the 
annotations from both annotators and made the final decision for 
both the boundary and the category of the argumentative element. 


The current study focuses on the identification of claims versus 
non-claims, mainly because of the small sample size of the corpus 
and the distribution of micro-categories. Thus, we combined the 
categories of Final Claim, Primary Claim, Counterclaim, and 
Rebuttal into a single category of claims. The remaining categories 
of Data and Concluding summary were classified as non-claims as 
was any non-annotated text. 
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2.3 Training and test sets 

Annotation of the data led to the classification of 2264 
argumentative elements. As mentioned in Section 2.2, the 
argumentative elements were inherently suprasentential. We 
further split the elements into sentences to determine whether this 
influenced accuracy. All sentences from the same argumentative 
element were given the same annotation as the original category 
(i.e., claims or non-claims). We thus had two data sets: 1) a 
sentence-tokenized data set (NV = 6326) and 2) a suprasentential 
data set (N = 2264). We randomly selected 70% of the 
argumentative elements as the training set, and the remaining 30% 
of the elements as the test set for both datasets. We report the 
number of argumentative elements, and number of claims and non- 
claims for the datasets in Table 2. 


Table 2. Numbers of elements, claims and non-claims for the 
training and test sets 


Number Number 
Number 
Data set of ; of non- 
of claims ; 
elements claims 
Suprasentential training set 1594 639 955 
Suprasentential test set 670 267 403 
Sentential training set 4401 935 3466 
Sentential test set 1925 409 1516 


2.4 Features 
2.4.1 Word count 


We extracted the number of words for each claim and non-claim at 
the sentential and suprasentential level. 


2.4.2 N-gram frequency 

We extracted n-grams and the POS combinations of these n-grams 
for both claims and non-claims. We assume that some n-grams (or 
POS n-grams) are more likely to identify claims versus non-claims 
(and vice versa), and the frequency of these key n-grams (or POS 
n-grams) could serve as good indicator of the type of an 
argumentative element or sentence. We used keyness values [21] 
as the measurement of importance of the n-grams or POS n-grams 
in claims and non-claims. Keyness values can provide evidence of 
whether n-grams and POS n-grams are more common in one corpus 
as compared with the other corpus. In the current study, we treated 
the claims and non-claims as two separate corpora. 


Raw and normalized frequency (i.e., normalized by the total 
number of words in all claims and non-claims, respectively) for 
each n-gram (or POS n-gram) that occurred both in claims and non- 
claims were calculated. The keyness value of each n-gram was also 
calculated based on the frequency data following Rayson and 
Garside’s guidelines [40]. Specifically, if an n-gram or POS n-gram 
had a keyness value greater than 3.84 (equivalent to p < 0.05), and 
if it had a higher normalized frequency in claims, it was considered 
more likely to occur in claims over non-claims, and vice versa. The 
range of the n-grams and POS n-grams was from unigram to seven- 
grams. NLTK [6] was used to tokenize the texts into n-grams and 
label the POS for the n-grams. For example, the following phrases 
should be, would be, can be, and will be were converted to the same 
POS n-gram combination: MD (modal) + VB (verb base). We did 
not remove stopwords before n-gram tokenization. For each 
suprasentential and sentential argumentative element in the training 
and test sets, we calculated the frequency of each type of the 


significant n-grams or POS n-grams (e.g., bigrams that were 
significant in claims), and normalized the frequency by the length 
(word counts). 


2.4.3 Positionality of the elements 

Beyond n-gram frequency, studies have shown that, the position of 
argumentative elements is an indicator of their structural function 
[e.g., 7, 8, 10]. In this study, two types of normalized positional 
variables for each argumentative element or sentence were 
calculated as positionality features. 


Normalized element or sentence position in an essay was computed 
as the ratio of the element/sentence position in an essay to the 
number of elements/sentences in the essay (e.g. if an 
argumentative element or a sentence was the 5" element or 
sentence in an essay of 10 elements/sentences in total, the value of 
this variable would be 5 divided by 10, or 50%). The normalized 
position of the element or sentence in a paragraph was computed as 
the ratio of the element/sentence position in a paragraph to the total 
number of elements/sentences in that paragraph. That means, if an 
argumentative element or a sentence was the 2"! element (sentence) 
in a paragraph, in which there were 5 elements (sentences) in total, 
the value would be 2 divided by 5, or 40%). 


2.4.4 Other lexical, syntactic, and semantic features 
To explore whether additional lexical, syntactic, cohesion, and 
cognitive text features increased the accuracy in identifying claims 
and non-claims, we extracted 925 features for each of the 
argumentative elements. These features were extracted using the 
Suite of Automatic Linguistic Analysis Tools (SALAT) [14, 15, 25, 
24]. SALAT includes multiple NLP tools including TAACO (Tool 
for the Automatic Analysis of Cohesion), TAALES (Tool for the 
Automatic Analysis of lexical Sophistication), TAASSC (Tool for 
the Automatic Analysis of Syntactic Sophistication and 
Complexity), and SEANCE (Sentiment Analysis and Cognition 
Engine). Two-sample t-tests or Wilcoxon’s tests were conducted 
using the variables after removing SALAT variables that were not 
normally distributed. We then removed those variables where the 
results of t-test or Wilcoxon’s test were not significant between the 
group of claims and non-claims. Finally, by visual inspection, 20 
out of 131 variables that were relevant to argumentative elements 
were selected. Hand selection of variables was done to avoid 
problems of overfitting. The selected NLP features and their 
descriptions are presented in Appendix B. 


2.4.5 Feature reduction 

To avoid multicollinearity, we conducted correlation analyses 
among all the derived features (one versus all) for the two training 
sets, respectively. If two or more variables correlated with r > 
0.699, the variable(s) with the lower correlation with the category 
of the argumentative element/sentence were removed, and the 
variable with the higher correlation was retained. The feature 
reduction process was done on the two training sets first and then 
applied to the test sets. After feature reduction, the frequency 
features that were retained included word count (of the 
argumentative element or sentence), the frequency of the 
significant unigram in claims and in non-claims, bigrams and quad- 
grams in claims, and the frequency of significant POS unigrams, 
trigrams, four-grams, five-grams in claims and in non-claims, and 
frequency of significant six-grams in claims. The two positionality 
features and the selected 20 SALAT features were also retained. 
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Table 3. Model accuracy results 


Classifier Model Accuracy Kappa Label Precision Recall Fl 
Suprasentential - Frequency 0.852 0.691 Non-Claim 0.874 0.881 0.878 
and positionality : : Claim 0.818 0.809 0.814 
: Non-Claim 0.867 0.876 0.872 
ieasiie Suprasentential - Full features 0.845 0.675 Claim 0.810 0.798 0.804 
Regression —_Sentential - Frequency and 0.802 0.216 Non-Claim 0.817 0.965 0.885 
positionality , : Claim 0.604 0.198 0.298 
: Non-Claim 0.823 0.951 0.882 
Sentential - Full features 0.800 0.244 Claim 0.569 0.242 0.340 
Suprasentential - Frequency 0.769 0.485 Non-Claim 0.747 0.931 0.829 
and positionality : ‘ Claim 0.833 0.524 0.644 
: Non-Claim 0.831 0.878 0.854 
Naive Suprasentential - Full features 0.819 0.618 Claim 0.799 0.730 0.763 
Bayes Sentential - Frequency and 0.791 0.267 Non-Claim 0.834 0.925 0.878 
positionality : : Claim 0.515 0.301 0.380 
P Non-Claim 0.833 0.916 0.872 
Sentential - Full features 0.789 0.271 Claim 0.506 0.318 0.390 
Suprasentential - Frequency 0.836 0.650 Non-Claim 0.835 0.906 0.869 
and positionality : : Claim 0.837 0.730 0.780 
P Non-Claim 0.760 0.943 0.842 
K-Nearest Suprasentential - Full features 0.787 0.526 Claim 0.865 0.551 0.673 
Neighbors —_ Sentential - Frequency and 0.818 0.286 Non-Claim 0.827 0.973 0.894 
positionality . 7 Claim 0.709 0.245 0.364 
: Non-Claim 0.813 0.976 0.887 
Sentential - Full features 0.804 0.196 Claim 0.654 0.166 0.265 
Suprasentential - Frequency 0.863 0.714 Non-Claim 0.886 0.886 0.886 
and positionality , 4 Claim 0.828 0.828 0.828 
P Non-Claim 0.865 0.856 0.860 
Support Suprasentential - Full features 0.833 0.652 Claim 0.786 0.798 0.792 
.. __ Sentential - Frequency and Non-Claim 0.839 0.951 0.891 
See sc osionAlity Bei OPES chai 0.639 0.325 0.431 
: Non-Claim 0.833 0.968 0.896 
Sentential - Full features 0.822 0.320 Claim 0.706 0.281 0.402 
Suprasentential - Frequency 0.873 0.734 Non-Claim 0.886 0.906 0.896 
and positionality : : Claim 0.853 0.824 0.838 
? Non-Claim 0.890 0.886 0.888 
Random Suprasentential - Full features 0.866 0.720 Claim 0.829 0.835 0.832 
Forest Sentential - Frequency and 0.832 0.419 Non-Claim 0.858 0.943 0.898 
positionality . , Claim 0.664 0.421 0.515 
. Non-Claim 0.850 0.951 0.897 
Sentential - Full features 0.829 0.390 Claim 0.672 0.377 0.483 


To examine whether adding the SALAT features improved the 
accuracy of claim identification, we created two versions of the 
feature sets. The first version comprised the n-gram frequency 
(including word count) features and positionality features, and the 
second version comprised all the features (including the SALAT 
NLP features). Combined with the different levels of discourse 
units (sentential and suprasentential), four pairs of datasets 
(training and test sets) were prepared for modeling: the frequency 
and positionality versions along with the full feature versions at 
both the sentential and suprasentential levels. 


2.4.6 Classifiers 

We used the ‘caret’ [23], ‘randomForest’ [28], ‘e1071’ [32], and 
‘tidyverse’ packages [48] in R [13] to apply Logistic Regression, 
Naive Bayes, K-Nearest Neighbors, Support Vector Machines, and 
Random Forest models. 10-fold cross validation with five repeats 
was used. We trained and tested the four versions of data separately. 


For the SVM classifier, a linear, polynomial, and radial kernel was 
applied. The model with the best performance was selected to make 
predictions on the test set. 


3. RESULTS 


3.1 Model evaluation 

The classification performances (precision, recall, Fl scores, 
accuracy, and Cohen’s kappa) of the multiple models on the test 
sets are reported in Table 3. 


Overall, the models developed on frequency and positionality 
features slightly outperformed the models developed using all the 
features. This indicates that adding lexical, syntactic, cohesion, and 
cognitive NLP features does not improve the accuracy of the 
classification of claims and non-claims. In terms of the selection of 
the unit of classification, the suprasentential models outperformed 
the sentential models. Finally, the suprasentential Random Forest 
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model based on frequency and positionality features yielded the 
best accuracy (0.873) and Kappa (0.734), followed by the 
suprasentential model based on the full feature set, which yielded 
an accuracy of 0.866 and Kappa of 0.720, which represents good 
performance based on the scale of Cohen’s Kappa values [11]. 


3.2 Important variables 

Variable importance for the best model (the suprasentential 
Random Forest model based on word count, n-gram frequency and 
positionality features) was reported by the ‘caret’ package. Table 4 
shows the top 10 important variables and their importance values 
for this model. 


The variable importance values showed that the length (word 
count) of an argumentative element, the normalized position of the 
argumentative element in the essay, and the frequency of 
significant bigrams in claims in the argumentative element are the 
three most important variables. 


Table 4. Variable importance values 


Variable umponaniee 
Value 

Word Count 289.988 
Normalized element position in the essay 162.083 
Frequency of significant bigrams in claims 47.992 
Frequency of significant unigrams in claims 31.147 
Normalized element position in the paragraph 29.791 
Frequency of significant POS five grams in claims 28.465 
Frequency of significant POS four grams in claims 27.389 
Frequency of significant unigrams in non-claims 25.399 
Frequency of significant POS unigrams in claims 25.272 
Frequency of significant POS unigrams in non- 93.812 
claims 
Frequency of significant POS trigrams in claims 20.364 
Frequency of significant POS trigrams in non- 

; 18.375 
claims 
Frequency of significant POS four grams in non- 

é 13.745 
claims 
Frequency of significant four grams in claims 8.676 
Frequency of significant POS six grams in claims 8.490 
Frequency of significant POS five grams in non- 4.210 


claims 


4. ERROR ANALYSES AND DISCUSSION 


We conducted error analyses for the two Random Forest 
suprasentential models (i.e., the models based on the frequency and 
positionality feature set and the full feature set). Our goal was to 
examine the misclassifications of the models to better understand 
elements that may contribute to model accuracy. 


We first examined classification rates. Among all incorrectly 
classified instances, we found more cases in which a claim was 
misclassified as a non-claim, whereas non-claims were less 
frequently misclassified as claims. For both models, around 17% of 
claims were misclassified and non-claims, and around 10% of non- 
claims were misclassified as claims. These results indicate that, the 
models are better at identifying non-claims than claims, potentially 


due to the imbalanced data between the claims and non-claims. 
Nevertheless, future studies should examine if there are more 
representative features in claims that can be integrated into our 
current feature set. 


We next examined if essay quality and length influenced the model 
accuracy. Specifically, for each argumentative element in the two 
suprasentential test sets, we extracted the following information: 
holistic score, number of words, number of sentences, and number 
of paragraphs in the essay where the argumentative element 
occurred. We examined differences between the argumentative 
elements that were correctly and incorrectly predicted for these 
features using t-tests. No differences were reported for essay 
quality and length in either model. Thus, the classification of 
argumentative elements was not related to the quality or the length 
of essays. 


We also examined if differences in model accuracy were related to 
more specific argumentation categories (i.e., micro-categories). As 
mentioned in Section 2.2, we merged the argumentation categories 
of Primary Claim, Final Claim, Counterclaim, and Rebuttal from 
the original annotated corpus into a larger classification of claims 
(i.e., a macro-classification). We also classified the remaining 
categories of Data and Concluding Summary along with Non- 
annotated texts into non-claims. To assess whether the micro- 
categories influenced classification of the macro-classification, we 
compared the prediction accuracies among the seven micro- 
categories. 


The results showed that Counterclaims were not misclassified in 
either model (likely because of their rarity), Concluding Summaries 
were not misclassified in the frequency- and positionality-based 
models, but misclassified 3.9% of the time in the full feature model. 
Data was misclassified around 9% in both models. Meanwhile, the 
sub-categories that were more frequently misclassified included: 
Primary Claims (around 14 misclassified), Final Claims (around 
21% misclassified), Non-annotated texts (around 22% 
misclassified), and Rebuttal (2 out of 3, 66.7% misclassified 
instances in both models). These results were also in line with 
findings that claims were more frequently misclassified as non- 
claims. 


To further explore what factors affect the misclassifications among 
the micro-categories of argumentative types, Welch’s t-tests were 
conducted among all NLP features (see Appendix B) used in the 
full analysis between correct and incorrect classification instances. 
However, the analysis was done for the sub-category of 
Counterclaim since all instances under this category were correctly 
predicted by the two models. Also, we did not conduct t-tests for 
the micro-category of Rebuttal due to a small sample size (N =3). 


Table 5 presents the features for which significant differences were 
found between the correct and incorrect classification instances in 
at least two categories of argumentative types. In general, the 
classification of Primary Claim, Data, Concluding Summary, and 
Non-annotated texts seemed to be more strongly influenced by 
linguistic features. Word count was the strongest indicator of 
misclassification, in which difference were found for each micro- 
category. The standard deviation of dependents per object of 
prepositions was another strong predictor of misclassification, 
which reflects the development of syntactic complexity [25]. 
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Table 5. Features with significant differences between correct and incorrect classification instances 


1 2 3 4 5 


Primary claim Yes Yes Yes 
Final claim Yes 
Data Yes Yes Yes Yes Yes 
Concluding summary Yes Yes 
Nonannotated Yes Yes Yes 


6 7 8 9 10 11 12 13 14 

Yes Yes Yes Yes Yes 
Yes Yes 

Yes Yes Yes Yes Yes Yes 
Yes Yes Yes Yes Yes 

Yes Yes Yes Yes 


Note. Shaded gray cells with ‘Yes’ indicate significant difference (p < .05) were found between the correct and incorrect instances. 1 = 
Number of named entities, 2 = Word count, 3 = Normalized element position in the paragraph, 4 = Normalized element position in the essay, 
5 = Frequency of significant unigrams in claims, 6 = Frequency of significant POS trigrams in claims, 7 = Frequency of significant quad- 
grams in claims, 8 = Hu Liu proportion score, 9 = Objects component score, 10 = Brown frequency score, 11 = Bigram lemma type-token 
ratio, 12 = Nouns as modifiers score, 13 = Dependents per object of the preposition (SD), 14 = T-units per sentence. 


The number of named entities was a strong indicator for the non- 
claims, wherein the incorrect instances of non-claims contained 
fewer named entities versus the correct instances. The nouns as 
modifier scores were also predictive of misclassification, which 
measured the use of nouns as nominal modifiers in general and the 
variation in the number of modifiers per nominal [25]. Other 
linguistics features that influenced the classification accuracy 
included: the normalized position of the element in paragraph and 
in essay, the bigram type-token ratio, the frequency of key unigram, 
quad-gram, and POS trigram in claims, the number of T-units per 
sentence, the number terms that reference objects, the proportion of 
the number of words with positive sentiments to the words with 
negative sentiments, and the mean frequency score based on 
London-Lund Corpus of Conversation. 


5. CONCLUSION 


In this study, we proposed an approach that combined the 
frequency, positionality, and other lexical, syntactic, cohesion, and 
cognitive NLP features to predict claims and non-claims in 
argumentative essays. Our model performed well in the 
classification of these argumentative elements. Our exploration of 
the features, the comparison between  sentential versus 
suprasentential models, and investigation of the factors that 
influenced classification accuracy in the error analyses should 
contribute to the field of automated identification and evaluation of 
discourse elements in argumentative writing. 


It is important to note that the corpus used for this study was 
relatively small, comprising 314 student essays. Thus, to gain 
higher accuracies and reliabilities in classifying argumentative 
elements, we plan on annotating more essays and expanding the 
current corpus. That also means we will use essays written to more 
prompts allowing us to extract key n-grams and POS n-grams that 
are more generic and less restricted to the specific prompts used 
here. In addition, due to the small sample size, our classification of 
argumentative elements was simplified to focus on claims versus 
non-claims. We are interested in exploring the classification of the 
micro-categories (Primary Claim, Final Claim, Counter Claim, 
Rebuttal, Data, and Concluding Summary) in a larger corpus. We 
also plan to include the prediction of the quality of these 
argumentative elements in students’ writing. 


The models developed in this study will be included in an online 
Writing Assessment Tool (WAT). Implementing the classification 
algorithm within WAT, WAT’s automatic writing evaluation 
(AWE) system will have the capacity to predict the number of 
claims in the essay and whether the claims mention the key n-grams 
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that reflects the argument topic. This will afford providing feedback 
to students on argumentation quality within student essays. The 
study also provides insight into the length, position, content (e.g., 
the key n-grams), and other NLP features in claims versus non- 
claims in students’ writing, which will contribute to finer-grained 
feedback components in our AWE system. 


This study also provides important information for others who are 
developing AWE algorithms to drive feedback on argumentative 
essays, or more broadly to better understand the use of claims in 
essays. Specifically, the results of this study inform features related 
to feedback that can be provided to students about the number of 
claims, mentioning the argument topic, how to better position 
argumentative elements within their essays, and how to pay 
attention to specific linguistic features (such as the use of named 
entities when giving evidence) in their writing. This is an important 
achievement in the realm of writing feedback given the crucial need 
to automate feedback to students on their use of claims and 
evidence in argumentative essays. 


Another important contribution of this study is that we also 
introduce a new corpus of essays annotated for argumentative 
elements, which is made publicly available at 
linguisticanalysistools.org. This corpus includes theoretically 
aligned argumentative elements that complement existing corpora 
[44, 45] and adds new components including prompts, holistic 
scores, additional categories of argumentation, and different 
educational settings. As such, this study provides the opportunity 
for other scientists to build upon our work such that we can better 
understand writing, and the features related to successful 
composition. 
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APPENDIX 


A. Definitions of argumentative elements 


Elements Definitions Examples 
: : An opinion or conclusion on the In my opinion, every individual has an obligation to think seriously about 
Final Claim . : F Na ae 
main question important matters, although this might be difficult. 
Actin tat sappane the deal The next reason why I agree that every individual has an obligation to 
Primary Claim asia think seriously about important matters is that this simple task can help each 
: person get ahead in life and be successful. 
A claim that refutes another claim or Some may argue that obligating every individual to think seriously is not 
Counterclaim — gives an opposing reason to the final necessary and even annoying as some people may choose to just follow the 
claim. great thinkers of the nation. 
Even though people can follow others’ steps without thinking seriously in 
Rebuttal A claim that refutes a counterclaim. — some situations, the ability to think critically for themselves is a very important 
survival skill. 
Ideas or examples that support For instance, the presidential debate is currently going on. In order to 
Data primary claims, counterclaims, or choose the right candidate, voters need to research all sides of both candidates 
rebuttals. and think seriously to make a wise decision for the good of the whole nation. 
, : Tos thinking seriously is important in making decisions because 
Concluding A concluding statement that restates ee ee ee 
f each decision has an outcome that affects lives. It is also important because if 
Summary the claims. : . : 
you think seriously it can help you succeed. 
Any text that doesn’t fall into any of People always strive to be unique or different. This idea clashes with 
Non-annotated é ; 5 
the above categories creativeness all through our lives. 


NLP features from SALAT 
Bigram lemma type-token ratio 
Brown frequency score 


Brysabaert concreteness score 
COCA academic bigram 
association strength 
Dependents per clause (SD) 
Dependents per object of the 
preposition (SD) 

Direct objects per clause 


Free association tokens response 
score 


Hu Liu proportion score 


LDA age of exposure score 


Lexical decision time 


Nouns as modifiers score 
Number of named entities 
Number of prepositions per clause 


Objects component score 
Possessives component score 
Sentiment score of dominance 
Sentiment score of overstating 


T-units per sentence 


Verb argument constructions 
association strength 


B. Descriptions of the SALAT NLP features 


Descriptions 


Number of unique bigram lemmas (types) divided by the number of total bigram lemmas (tokens) 
Mean word frequency score based on London-Lund Corpus of Conversation 


Sum of concreteness scores based on all words divided by number of words with concreteness scores 


Sum of approximate collexeme strength score divided by the number of bigrams in text with collexeme 
scores 


The standard deviation of the total number of dependents per clause 
This score captures the variation (standard deviation) in the prepositional objects 


The number of direct objects per clause 

Number of response tokens elicited by word as stimuli in discrete word association experiment (based 
on function words) 

Proportion of the number of words with positive sentiments to the words with negative sentiments 


Based on Incremental Age of Exposure for words across 13 grade levels; calculated based on 1/slope 
of linear regression 

Standardized lexical decision reaction time across all participants for this word (z-score, based on 
function words) 


This score captures the use of nouns as modifiers and modifier variation 
The number of named entities 
This score captures capture noun phrase elaboration and clause complexity 


This component score represents the number of terms that reference objects 


This component score captures the use of possessives in general, and specifically captures the use of 
possessives in nominal subjects, direct objects, and prepositional objects 


This score captures the sentiment of dominance, measured by the number of words of dominance 


This score captures the sentiment of overstating, calculated based on words indicating emphasis in 
realms of frequency, causality, accuracy, validity... 


Number of T-units in text divided by number of sentences in text 


Average approximate collostructional strength score based on the COCA academic corpus 


Note. For more information about the SALAT NLP features, please see https://www.linguisticanalysistools.org/ 
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